Scalable Reliability Modelling of RAID Storage Subsystems
نویسندگان
چکیده
Reliability modelling of RAID storage systems with its various components such as RAID controllers, enclosures, expanders, interconnects and disks is important from a storage system designer’s point of view. A model that can express all the failure characteristics of the whole RAID storage system can be used to evaluate design choices, perform cost reliability trade-offs and conduct sensitivity analyses. However, including such details makes the computational models of reliability quickly infeasible. We present a CTMC reliability model for RAID storage systems that scales to much larger systems than heretofore reported and we try to model all the components as accurately as possible. We use several state-space reduction techniques at the user level, such as aggregating all in-series components and hierarchical decomposition, to reduce the size of our model. To automate computation of reliability, we use the PRISM model checker as a CTMC solver where appropriate. We use both variations of PMC − numerical as well as statistical model checking according to the size of our model. Our modelling techniques using PRISM are more practical (in both time and effort) compared to previously reported Monte-Carlo simulation techniques. Our model for RAID storage systems (that includes, for example, disks, expanders, enclosures) uses Weibull distributions for disks and, where appropriate, correlated failure modes for disks, while we use exponential distributions with independent failure modes for all other components. To use the CTMC solver, we approximate the Weibull distribution for a disk using sum of exponentials and we confirm that this model gives results that are in reasonably good agreement with those from the sequential Monte Carlo simulation methods for RAID disk subsystems reported in literature earlier. Using a combination of scalable techniques, we are able to model and compute reliability for fairly large configurations with upto 600 disks using this model.
منابع مشابه
Reliability Modelling of Whole RAID Storage Subsystems
Reliability modelling of RAID storage systems with its various components such as RAID controllers, enclosures, expanders, interconnects and disks is important from a storage system designer's point of view. A model that can express all the failure characteristics of the whole RAID storage system can be used to evaluate design choices, perform cost reliability trade-o s and conduct sensitivity ...
متن کاملECKD CAID and RAID: When is the Right Time to Write?
Traditional dual copy, CAID, and RAID DASD subsystems can offer improved data reliability in cases of actuator and/or media failures. However, these schemes impose a write penalty for the extra I/Os required to maintain image copies or parity information for data contained in the subsystem. In this paper, we will employ a review of the 3990-3/6 read and write data flows as a basis for discussin...
متن کاملRAID0.5: Active Data Replication for Low Cost Disk Array Data Protection
RAID has long been established as an effective way to provide highly reliable as well as high-performance disk subsystems. However, reliability in RAID systems comes at the cost of extra disks. In this paper, we describe a mechanism that we have termed RAID0.5 that enables striped disks with very high data reliability but low disk cost. We take advantage of the fact that most disk systems use b...
متن کاملDiskReduce: Replication as a Prelude to Erasure Coding in Data-Intensive Scalable Computing
The first generation of Data-Intensive Scalable Computing file systems such as Google File System and Hadoop Distributed File System employed n (n ≥ 3) replications for high data reliability, therefore delivering users only about 1/n of the total storage capacity of the raw disks. This paper presents DiskReduce, a framework integrating RAID into these replicated storage systems to significantly...
متن کاملDiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL-09-112)
Data-intensive file systems, developed for Internet services and popular in cloud computing, provide high reliability and availability by replicating data, typically three copies of everything. Alternatively high performance computing, which has comparable scale, and smaller scale enterprise storage systems get similar tolerance for multiple failures from lower overhead erasure encoding, or RAI...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1508.02055 شماره
صفحات -
تاریخ انتشار 2015